An introduction to resampling methods
Reference: Data Science Chapter 5
From the New England Journal of Medicine in 2006:
We randomly assigned patients with resectable adenocarcinoma of the stomach, esophagogastric junction, or lower esophagus to either perioperative chemotherapy and surgery (250 patients) or surgery alone (253 patients)…. With a median follow-up of four years, 149 patients in the perioperative-chemotherapy group and 170 in the surgery group had died. As compared with the surgery group, the perioperative-chemotherapy group had a higher likelihood of overall survival (five-year survival rate, 36 percent vs. 23 percent).
Conclusion:
Conclusion:
Not so fast! In statistics, we ask “what if?” a lot:
Conclusion:
Always remember two basic facts about samples:
Conclusion:
By “quantifying uncertainty,” we mean filling in the blanks.
In stats, we equate trustworthiness with stability:
\[ \begin{array}{r} \mbox{Confidence in} \\ \mbox{your estimates} \\ \end{array} \iff \begin{array}{l} \mbox{Stability of those estimates} \\ \mbox{under the influence of chance} \\ \end{array} \]
For example:
Let's work through a thought experiment…
Imagine Andrey Kolmorogov on four-day fishing trip.
At right we see the sampling distribution for both \( \beta_0 \) and \( \beta_1 \).
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Now imagine repeating this process thousands of times! Since \( \hat{\theta}_N \) is a random variable, it has a probability distribution.
Estimator: any method for estimating the value of a parameter (e.g. sample mean, sample proportion, slope of OLS line, etc).
Sampling distribution: the probability distribution of an estimator \( \hat{\theta}_N \) under repeated samples of size \( N \).
Bias: Let \( \bar{\theta}_N = E(\hat{\theta}_N) \) be the mean of the sampling distribution. The bias of \( \hat{\theta}_N \) is \( (\bar{\theta}_N - \theta) \): the difference between the average answer and the truth.
Unbiased estimator: \( (\bar{\theta}_N - \theta) = 0 \).
Standard error: the standard deviation of an estimator's sampling distribution:
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ \mbox{var}(\hat{\theta}_N) } \\ &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from its average} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how much does the answer vary, on average?”
If an estimator is unbiased, then \( \bar{\theta}_N = \theta \), so
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \sqrt{ E[ (\hat{\theta}_N - \theta )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from the truth} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how big of an error do I make, on average?”